10 Minutes to pandas

#python #pandas

Python

pandas官方入门教程：10 Minutes to pandas

下面的操作依赖于这些包，所以操作之前要引入这些包

1
2
3

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

1. Object Creation 创建对象

将数组转换为Series对象

In [4]: s = pd.Series([1,3,5,np.nan,6,8])
In [5]: s
Out[5]: 
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

将numpy数组转换为DataFrame

pandas.date_range

创建一个datetime index，起始时间为20170320，时间跨度为7天，默认的freq为'D'。如果没有periods参数则必须指定结束时间end。

In [6]: dates = pd.date_range('20170320', periods=7)
In [7]: dates
Out[7]: 
DatetimeIndex(['2017-03-20', '2017-03-21', '2017-03-22', '2017-03-23',
               '2017-03-24', '2017-03-25', '2017-03-26', '2017-03-27'],
              dtype='datetime64[ns]', freq='D')
In [8]: dates = pd.date_range('20170320','20170327')
In [9]: dates
Out[9]: 
DatetimeIndex(['2017-03-20', '2017-03-21', '2017-03-22', '2017-03-23',
               '2017-03-24', '2017-03-25', '2017-03-26', '2017-03-27'],
              dtype='datetime64[ns]', freq='D')
# 更改freq
In[10]: dates = pd.date_range('20170320', periods=4,freq='w-mon')
In[11]: dates
Out[11]: 
DatetimeIndex(['2017-03-20', '2017-03-27', '2017-04-03', '2017-04-10'], dtype='datetime64[ns]', freq='W-MON')

np.random.randn

np.random.randn(4,4)生成一个numpy4*4的数组，其中的每一个数服从正态分布。
index=dates 将dates指定为index。
columns指定列名称

In[86]: df = pd.DataFrame(np.random.randn(4,4), index=dates, columns=list('ABCD'))
In[87]: df
Out[87]: 
                   A         B         C         D
2017-03-20 -0.273931  1.464139  0.511461  0.707071
2017-03-27  0.290527  0.799572  0.526141  0.696459
2017-04-03  0.390172  0.171379 -1.321742  0.941358
2017-04-10  0.963824 -1.089391  0.133069  1.007356

将字典对象转换成DataFrame

In[88]: df2 = pd.DataFrame({ 'A' : 1.,
   ....:                      'B' : pd.Timestamp('20130102'),
   ....:                      'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
   ....:                      'D' : np.array([3] * 4,dtype='int32'),
   ....:                      'E' : pd.Categorical(["test","train","test","train"]),
   ....:                      'F' : 'foo' })
In[89]: df2
Out[89]: 
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

2. Viewing Data 查看数据

dtypes 查看各列的数据类型

In[94]: df2.dtypes
Out[94]: 
A     float64
B       int64
C     float32
D       int32
E    category
F      object
dtype: object
   
In[96]: df['A'].dtype
Out[96]: 
dtype('float64')

head(),tail(),查看数据框头几行和最后几行的数据

In[97]: df.head()
Out[97]: 
                   A         B         C         D
2017-03-20 -0.273931  1.464139  0.511461  0.707071
2017-03-27  0.290527  0.799572  0.526141  0.696459
2017-04-03  0.390172  0.171379 -1.321742  0.941358
2017-04-10  0.963824 -1.089391  0.133069  1.007356
In[98]: df.tail(2)
Out[98]: 
                   A         B         C         D
2017-04-03  0.390172  0.171379 -1.321742  0.941358
2017-04-10  0.963824 -1.089391  0.133069  1.007356

查看索引df.index

1
2
3

In[99]: df.index
Out[99]: 
DatetimeIndex(['2017-03-20', '2017-03-27', '2017-04-03', '2017-04-10'], dtype='datetime64[ns]', freq='W-MON')

查看列df.colums

1
2
3

In[100]: df.columns
Out[100]: 
Index(['A', 'B', 'C', 'D'], dtype='object')

查看这个DataFrame下的numpy data:df.values

In[101]: df.values
Out[101]: 
array([[-0.27393096,  1.46413903,  0.5114613 ,  0.70707094],
       [ 0.29052687,  0.79957181,  0.52614137,  0.69645894],
       [ 0.39017244,  0.17137885, -1.32174227,  0.9413581 ],
       [ 0.9638237 , -1.08939135,  0.13306936,  1.00735588]])

df.describe:给出这个数据框的statistic summary

In[102]: df.describe()
Out[102]: 
              A         B         C         D
count  4.000000  4.000000  4.000000  4.000000
mean   0.342648  0.336425 -0.037768  0.838061
std    0.506953  1.087265  0.875104  0.159730
min   -0.273931 -1.089391 -1.321742  0.696459
25%    0.149412 -0.143814 -0.230634  0.704418
50%    0.340350  0.485475  0.322265  0.824215
75%    0.533585  0.965714  0.515131  0.957858
max    0.963824  1.464139  0.526141  1.007356

df.T:转置矩阵，转置后index也随之改变

In[103]: df.T
Out[103]: 
   2017-03-20  2017-03-27  2017-04-03  2017-04-10
A   -0.273931    0.290527    0.390172    0.963824
B    1.464139    0.799572    0.171379   -1.089391
C    0.511461    0.526141   -1.321742    0.133069
D    0.707071    0.696459    0.941358    1.007356
In[104]: df.T.index
Out[104]: 
Index(['A', 'B', 'C', 'D'], dtype='object')

df.sort_index():根据行列索引排序，默认axis=0，根据索引排序，当axis=1，根据列排序

In[119]: df.sort_index(ascending=True)
Out[119]: 
                   A         B         C         D
2017-03-20 -0.273931  1.464139  0.511461  0.707071
2017-03-27  0.290527  0.799572  0.526141  0.696459
2017-04-03  0.390172  0.171379 -1.321742  0.941358
2017-04-10  0.963824 -1.089391  0.133069  1.007356
In[120]: df.sort_index(ascending=False)
Out[120]: 
                   A         B         C         D
2017-04-10  0.963824 -1.089391  0.133069  1.007356
2017-04-03  0.390172  0.171379 -1.321742  0.941358
2017-03-27  0.290527  0.799572  0.526141  0.696459
2017-03-20 -0.273931  1.464139  0.511461  0.707071
In[122]: df.sort_index(axis=1, ascending=False)
Out[122]: 
                   D         C         B         A
2017-03-20  0.707071  0.511461  1.464139 -0.273931
2017-03-27  0.696459  0.526141  0.799572  0.290527
2017-04-03  0.941358 -1.321742  0.171379  0.390172
2017-04-10  1.007356  0.133069 -1.089391  0.963824
In[123]: df.sort_index(axis=1, ascending=True)
Out[123]: 
                   A         B         C         D
2017-03-20 -0.273931  1.464139  0.511461  0.707071
2017-03-27  0.290527  0.799572  0.526141  0.696459
2017-04-03  0.390172  0.171379 -1.321742  0.941358
2017-04-10  0.963824 -1.089391  0.133069  1.007356

df.sort_values:根据值排序

In[124]: df.sort_values(by='A')
Out[124]: 
                   A         B         C         D
2017-03-20 -0.273931  1.464139  0.511461  0.707071
2017-03-27  0.290527  0.799572  0.526141  0.696459
2017-04-03  0.390172  0.171379 -1.321742  0.941358
2017-04-10  0.963824 -1.089391  0.133069  1.007356
In[125]: df.sort_values(by='A',ascending=False)
Out[125]: 
                   A         B         C         D
2017-04-10  0.963824 -1.089391  0.133069  1.007356
2017-04-03  0.390172  0.171379 -1.321742  0.941358
2017-03-27  0.290527  0.799572  0.526141  0.696459
2017-03-20 -0.273931  1.464139  0.511461  0.707071

3. Selection

3.1 Getting

df.A，df[‘A’]：将A列的数据取出，两者的作用是相同的

In[10]: df.A
Out[10]: 
2017-03-20    0.387798
2017-03-27    0.221780
2017-04-03   -1.585317
2017-04-10    0.035160
Freq: W-MON, Name: A, dtype: float64
In[11]: df['A']
Out[11]: 
2017-03-20    0.387798
2017-03-27    0.221780
2017-04-03   -1.585317
2017-04-10    0.035160
Freq: W-MON, Name: A, dtype: float64

按行取数据

In[12]: df[0:3]
Out[12]: 
                   A         B         C         D
2017-03-20  0.387798 -0.058203 -0.169525 -0.612571
2017-03-27  0.221780 -1.170097  0.876672  0.552794
2017-04-03 -1.585317  2.253459  0.109537  0.028095
In[14]: df['2017-03-20':'2017-04-03']
Out[14]: 
                   A         B         C         D
2017-03-20  0.387798 -0.058203 -0.169525 -0.612571
2017-03-27  0.221780 -1.170097  0.876672  0.552794
2017-04-03 -1.585317  2.253459  0.109537  0.028095

3.2 Selection by Label

df.loc：按照索引和列取数据

df.at：和df.loc类似，但是只能取一个元素的值，只取一个元素的值，at是更为快速的方法

In[16]: df.loc['2017-03-20']
Out[16]: 
A    0.387798
B   -0.058203
C   -0.169525
D   -0.612571
Name: 2017-03-20 00:00:00, dtype: float64
                
In[17]: df.loc[dates[0]]
Out[17]: 
A    0.387798
B   -0.058203
C   -0.169525
D   -0.612571
Name: 2017-03-20 00:00:00, dtype: float64
        
In[34]: df.loc[:,'A']
Out[34]: 
2017-03-20    0.387798
2017-03-27    0.221780
2017-04-03   -1.585317
2017-04-10    0.035160
Freq: W-MON, Name: A, dtype: float64
          
In[36]: df.loc[dates[0:3],'A']
Out[36]: 
2017-03-20    0.387798
2017-03-27    0.221780
2017-04-03   -1.585317
Freq: W-MON, Name: A, dtype: float64
            
In[37]: df.loc[dates[0:3],['A','B']]
Out[37]: 
                   A         B
2017-03-20  0.387798 -0.058203
2017-03-27  0.221780 -1.170097
2017-04-03 -1.585317  2.253459
In[39]: df.loc[:,['A','B']]
Out[39]: 
                   A         B
2017-03-20  0.387798 -0.058203
2017-03-27  0.221780 -1.170097
2017-04-03 -1.585317  2.253459
2017-04-10  0.035160  0.143292
In[74]: df.at[dates[0],'A']
Out[74]: 
0.38779782953047309

3.3 Selection by Position

df.iloc：根据位置取数据，类似于matlab里面的取矩阵的方法

df.iat：和df.iloc类似，但是只能取一个元素的值

In[44]: df.iloc[3]
Out[44]: 
A    0.035160
B    0.143292
C    0.139537
D   -0.823278
Name: 2017-04-10 00:00:00, dtype: float64
                
In[47]: df.iloc[2:4]
Out[47]: 
                   A         B         C         D
2017-04-03 -1.585317  2.253459  0.109537  0.028095
2017-04-10  0.035160  0.143292  0.139537 -0.823278
In[43]: df.iloc[2:4,0:2]
Out[43]: 
                   A         B
2017-04-03 -1.585317  2.253459
2017-04-10  0.035160  0.143292
In[49]: df.iloc[:,2:4]
Out[49]: 
                   C         D
2017-03-20 -0.169525 -0.612571
2017-03-27  0.876672  0.552794
2017-04-03  0.109537  0.028095
2017-04-10  0.139537 -0.823278
In[50]: df.iloc[3,2]
Out[50]: 
0.13953667147016041
In[75]: df.iat[0,0]
Out[75]: 
0.38779782953047309

3.4 Boolean Indexing

In[52]: df[df.A > 0]
Out[52]: 
                   A         B         C         D
2017-03-20  0.387798 -0.058203 -0.169525 -0.612571
2017-03-27  0.221780 -1.170097  0.876672  0.552794
2017-04-10  0.035160  0.143292  0.139537 -0.823278
In[53]: df[df > 0]
Out[53]: 
                   A         B         C         D
2017-03-20  0.387798       NaN       NaN       NaN
2017-03-27  0.221780       NaN  0.876672  0.552794
2017-04-03       NaN  2.253459  0.109537  0.028095
2017-04-10  0.035160  0.143292  0.139537       NaN

isin()

In[67]: df2 = df.copy()
In[68]: df2['E'] = ['one', 'one','two','three']
In[69]: df2
Out[69]: 
                   A         B         C         D      E
2017-03-20  0.387798 -0.058203 -0.169525 -0.612571    one
2017-03-27  0.221780 -1.170097  0.876672  0.552794    one
2017-04-03 -1.585317  2.253459  0.109537  0.028095    two
In[70]: df2['E'].isin( ['one', 'one','two'])
Out[70]: 
2017-03-20     True
2017-03-27     True
2017-04-03     True
2017-04-10    False
Freq: W-MON, Name: E, dtype: bool
            
In[71]: df2[df2['E'].isin(['one', 'one','two'])]
Out[71]: 
                   A         B         C         D    E
2017-03-20  0.387798 -0.058203 -0.169525 -0.612571  one
2017-03-27  0.221780 -1.170097  0.876672  0.552794  one
2017-04-03 -1.585317  2.253459  0.109537  0.028095  two

4. Missing Data

缺失值：NaN

df.reindex：重置索引，原先不存在的索引所对应的值为NaN

df.dropna()：将缺失值删除

df.fillna()：填充缺失值

In[78]: df.reindex(index=pd.date_range('2017-03-27',periods=6,freq='W-MON'))
Out[78]: 
                   A         B         C         D
2017-03-27  0.221780 -1.170097  0.876672  0.552794
2017-04-03 -1.585317  2.253459  0.109537  0.028095
2017-04-10  0.035160  0.143292  0.139537 -0.823278
2017-04-17       NaN       NaN       NaN       NaN
2017-04-24       NaN       NaN       NaN       NaN
2017-05-01       NaN       NaN       NaN       NaN
In[79]: df3 = df.reindex(index=pd.date_range('2017-03-27',periods=6,freq='W-MON'))
In[80]: df3.dropna(how='any')
Out[80]: 
                   A         B         C         D
2017-03-27  0.221780 -1.170097  0.876672  0.552794
2017-04-03 -1.585317  2.253459  0.109537  0.028095
2017-04-10  0.035160  0.143292  0.139537 -0.823278
In[81]: df3.fillna(value=1)
Out[81]: 
                   A         B         C         D
2017-03-27  0.221780 -1.170097  0.876672  0.552794
2017-04-03 -1.585317  2.253459  0.109537  0.028095
2017-04-10  0.035160  0.143292  0.139537 -0.823278
2017-04-17  1.000000  1.000000  1.000000  1.000000
2017-04-24  1.000000  1.000000  1.000000  1.000000
2017-05-01  1.000000  1.000000  1.000000  1.000000
In[82]: pd.isnull(df3)
Out[82]: 
                A      B      C      D
2017-03-27  False  False  False  False
2017-04-03  False  False  False  False
2017-04-10  False  False  False  False
2017-04-17   True   True   True   True
2017-04-24   True   True   True   True
2017-05-01   True   True   True   True

5. Operations

5.1 Stats

df.mean()：默认按照索引取均值

In[83]: df.mean(0)
Out[83]: 
A   -0.235145
B    0.292113
C    0.239055
D   -0.213740
dtype: float64
In[84]: df.mean()
Out[84]: 
A   -0.235145
B    0.292113
C    0.239055
D   -0.213740
dtype: float64
In[85]: df.mean(1)
Out[85]: 
2017-03-20   -0.113125
2017-03-27    0.120287
2017-04-03    0.201443
2017-04-10   -0.126322
Freq: W-MON, dtype: float64

df.sub 减法,df.add 加法

In[112]: df.sub(df['A'],axis=0)
Out[112]: 
              A         B         C         D
2017-03-20  0.0 -0.446001 -0.557323 -1.000369
2017-03-27  0.0 -1.391877  0.654892  0.331014
2017-04-03  0.0  3.838775  1.694853  1.613412
2017-04-10  0.0  0.108132  0.104377 -0.858438
In[113]: df.add(df['A'],axis=0)
Out[113]: 
                   A         B         C         D
2017-03-20  0.775596  0.329595  0.218273 -0.224773
2017-03-27  0.443560 -0.948317  1.098452  0.774574
2017-04-03 -3.170634  0.668142 -1.475780 -1.557222
2017-04-10  0.070320  0.178452  0.174697 -0.788119

5.2 Apply

df.apply()：Applying functions to the data

In[90]: df.apply(np.cumsum)
Out[90]: 
                   A         B         C         D
2017-03-20  0.387798 -0.058203 -0.169525 -0.612571
2017-03-27  0.609578 -1.228301  0.707147 -0.059777
2017-04-03 -0.975739  1.025158  0.816683 -0.031682
2017-04-10 -0.940579  1.168450  0.956220 -0.854961
In[91]: df.apply(lambda x: x.max()-x.min())
Out[91]: 
A    1.973115
B    3.423556
C    1.046196
D    1.376072
dtype: float64

5.3 Histogramming

value_counts() 计数

In[114]: s = pd.Series(np.random.randint(0, 7, size=10))
In[115]: s
Out[115]: 
0    5
1    4
2    6
3    3
4    4
5    6
6    5
7    0
8    6
9    4
dtype: int64
In[117]: s.value_counts()
Out[117]: 
6    3
4    3
5    2
3    1
0    1
dtype: int64

5.4 String Methods

Series.str下面还有很多方法

In[118]: s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
In[121]: s.str.upper()
Out[121]: 
0       A
1       B
2       C
3    AABA
4    BACA
5     NaN
6    CABA
7     DOG
8     CAT
dtype: object

6. Merge

6.1 Concat

默认axis=0,按行拼接,axis=1,按列拼接

In[123]: df = pd.DataFrame(np.random.randn(6,6))
In[124]: df
Out[124]: 
          0         1         2         3         4         5
0 -0.829274 -0.862832  0.799663 -1.115011  2.055959 -0.569186
1  0.058196 -0.571979  0.124036  1.777956 -1.192992 -1.427277
2 -0.442932 -1.177776  0.669681 -1.244613 -0.213186 -2.433752
3  1.305375 -0.540983 -0.232134 -0.373056 -0.036850 -0.171402
4 -0.574777 -0.632868 -0.268940  0.613305  1.000270  1.800234
5  0.874804 -1.711324 -0.240330 -0.183900  0.154792 -1.423335
In[126]: pd.concat([df[:2],df[3:5]])
Out[126]: 
          0         1         2         3         4         5
0 -0.829274 -0.862832  0.799663 -1.115011  2.055959 -0.569186
1  0.058196 -0.571979  0.124036  1.777956 -1.192992 -1.427277
3  1.305375 -0.540983 -0.232134 -0.373056 -0.036850 -0.171402
4 -0.574777 -0.632868 -0.268940  0.613305  1.000270  1.800234
    
In[131]: pd.concat([df.iloc[:,0:2],df.iloc[:,3:5]],axis=1)
Out[131]: 
          0         1         3         4
0 -0.829274 -0.862832 -1.115011  2.055959
1  0.058196 -0.571979  1.777956 -1.192992
2 -0.442932 -1.177776 -1.244613 -0.213186
3  1.305375 -0.540983 -0.373056 -0.036850
4 -0.574777 -0.632868  0.613305  1.000270
5  0.874804 -1.711324 -0.183900  0.154792

6.2 Join

6.3 Append

Append rows to a dataframe. See the Appending

In[132]: df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
In[133]: df
Out[133]: 
          A         B         C         D
0 -1.077744  0.751278  0.261896  0.738191
1  0.865350 -1.652052  0.090780 -0.022723
2 -0.205837  0.042363 -0.602793 -1.634198
3 -0.232356 -0.905884 -0.495916  0.865143
4 -1.218397 -0.091938 -2.044627 -0.688541
5  0.735322 -0.331950 -1.392685 -1.059044
6 -1.156664  0.136152  0.112945 -0.965557
7  0.050109  0.071000 -1.793460  0.071480
In[135]: df.append(df[3:5])
Out[135]: 
          A         B         C         D
0 -1.077744  0.751278  0.261896  0.738191
1  0.865350 -1.652052  0.090780 -0.022723
2 -0.205837  0.042363 -0.602793 -1.634198
3 -0.232356 -0.905884 -0.495916  0.865143
4 -1.218397 -0.091938 -2.044627 -0.688541
5  0.735322 -0.331950 -1.392685 -1.059044
6 -1.156664  0.136152  0.112945 -0.965557
7  0.050109  0.071000 -1.793460  0.071480
3 -0.232356 -0.905884 -0.495916  0.865143
4 -1.218397 -0.091938 -2.044627 -0.688541
In[136]: df.append(df[3:5], ignore_index=True)
Out[136]: 
          A         B         C         D
0 -1.077744  0.751278  0.261896  0.738191
1  0.865350 -1.652052  0.090780 -0.022723
2 -0.205837  0.042363 -0.602793 -1.634198
3 -0.232356 -0.905884 -0.495916  0.865143
4 -1.218397 -0.091938 -2.044627 -0.688541
5  0.735322 -0.331950 -1.392685 -1.059044
6 -1.156664  0.136152  0.112945 -0.965557
7  0.050109  0.071000 -1.793460  0.071480
8 -0.232356 -0.905884 -0.495916  0.865143
9 -1.218397 -0.091938 -2.044627 -0.688541

6.4 Grouping

df.groupby(by=)

In[137]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ....:                           'foo', 'bar', 'foo', 'foo'],
   ....:                    'B' : ['one', 'one', 'two', 'three',
   ....:                           'two', 'two', 'one', 'three'],
   ....:                    'C' : np.random.randn(8),
   ....:                    'D' : np.random.randn(8)})
In[138]: df
Out[138]: 
     A      B         C         D
0  foo    one -1.378873 -1.289674
1  bar    one -0.800759  1.247083
2  foo    two -1.172841  0.605887
3  bar  three -0.232046  0.328599
4  foo    two -0.027785  1.644086
5  bar    two -0.456277 -1.737294
6  foo    one -0.092326 -0.424813
7  foo  three  0.676421 -0.328490
In[140]: df.groupby(by='A').sum()
Out[140]: 
            C         D
A                      
bar -1.489081 -0.161612
foo -1.995404  0.206996
In[141]: df.groupby(by=['A', 'B']).sum()
Out[141]: 
                  C         D
A   B                        
bar one   -0.800759  1.247083
    three -0.232046  0.328599
    two   -0.456277 -1.737294
foo one   -1.471198 -1.714486
    three  0.676421 -0.328490
    two   -1.200627  2.249972

6.5 Reshaping

Stack

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
   ....:                      'foo', 'foo', 'qux', 'qux'],
   ....:                     ['one', 'two', 'one', 'two',
   ....:                      'one', 'two', 'one', 'two']]))
tuples
Out[27]: 
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
index
Out[29]: 
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df
Out[31]: 
                     A         B
first second                    
bar   one     0.398046  0.488732
      two    -0.024753 -0.567505
baz   one    -0.509408  0.055228
      two     0.936854  0.326829
foo   one     0.390471  0.540038
      two    -1.001224  0.858414
qux   one    -1.640981  0.103979
      two    -0.167781  0.949106
df[:4]
Out[32]: 
                     A         B
first second                    
bar   one     0.398046  0.488732
      two    -0.024753 -0.567505
baz   one    -0.509408  0.055228
      two     0.936854  0.326829
stacked = df2.stack()
stacked
Out[34]: 
first  second   
bar    one     A   -1.131337
               B    0.734982
       two     A    0.008166
               B   -1.358364
baz    one     A   -1.578985
               B    0.399612
       two     A    1.697060
               B    0.830027
dtype: float64
    
stacked.unstack()
Out[35]: 
                     A         B
first second                    
bar   one    -1.131337  0.734982
      two     0.008166 -1.358364
baz   one    -1.578985  0.399612
      two     1.697060  0.830027
stacked.unstack().unstack()
Out[36]: 
               A                   B          
second       one       two       one       two
first                                         
bar    -1.131337  0.008166  0.734982 -1.358364
baz    -1.578985  1.697060  0.399612  0.830027
stacked.unstack(1)
Out[37]: 
second        one       two
first                      
bar   A -1.131337  0.008166
      B  0.734982 -1.358364
baz   A -1.578985  1.697060
      B  0.399612  0.830027
stacked.unstack(0)
Out[38]: 
first          bar       baz
second                      
one    A -1.131337 -1.578985
       B  0.734982  0.399612
two    A  0.008166  1.697060
       B -1.358364  0.830027
stacked.unstack(2)
Out[39]: 
                     A         B
first second                    
bar   one    -1.131337  0.734982
      two     0.008166 -1.358364
baz   one    -1.578985  0.399612
      two     1.697060  0.830027

6.6 Time Series

resample：对时间序列做类似于groupby的操作

In[163]: rng = pd.date_range('1/1/2012', periods=10, freq='S')
In[164]: ts = pd.Series(np.random.randn(10), index=rng)
In[165]: ts
Out[165]: 
2012-01-01 00:00:00    0.924984
2012-01-01 00:00:01   -1.253491
2012-01-01 00:00:02   -0.417912
2012-01-01 00:00:03   -0.104731
2012-01-01 00:00:04   -1.165980
2012-01-01 00:00:05   -0.535897
2012-01-01 00:00:06   -0.937104
2012-01-01 00:00:07   -0.143732
2012-01-01 00:00:08   -0.292696
2012-01-01 00:00:09   -0.700354
Freq: S, dtype: float64
In[166]: ts.resample('2S').sum()
Out[166]: 
2012-01-01 00:00:00   -0.328507
2012-01-01 00:00:02   -0.522643
2012-01-01 00:00:04   -1.701877
2012-01-01 00:00:06   -1.080836
2012-01-01 00:00:08   -0.993050
Freq: 2S, dtype: float64

to_period,to_timestamp：Timestamped data can be converted to PeriodIndex-ed data using to_period and vice-versa using to_timestamp

In[167]: rng = pd.date_range('1/1/2012', periods=5, freq='M')
In[168]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
In[169]: ts
Out[169]: 
2012-01-31   -0.174429
2012-02-29    2.073488
2012-03-31    1.459442
2012-04-30    0.703198
2012-05-31   -1.796758
Freq: M, dtype: float64
In[175]: ps = ts.to_period()
In[176]: ps
Out[176]: 
2012-01   -0.174429
2012-02    2.073488
2012-03    1.459442
2012-04    0.703198
2012-05   -1.796758
Freq: M, dtype: float64
        
In[177]: ps.to_timestamp()
Out[177]: 
2012-01-01   -0.174429
2012-02-01    2.073488
2012-03-01    1.459442
2012-04-01    0.703198
2012-05-01   -1.796758
Freq: MS, dtype: float64
# Remember that ‘s’ and ‘e’ can be used to return the timestamps at the start or end of the period:
In[179]: ps.to_timestamp('D', 'e')
Out[179]: 
2012-01-31   -0.174429
2012-02-29    2.073488
2012-03-31    1.459442
2012-04-30    0.703198
2012-05-31   -1.796758
Freq: M, dtype: float64

7. Getting Data In/Out

7.1 CSV

df.to_csv()

pd.read_csv()

7.2 Json

to_json

read_json

In[181]: dfj = pd.DataFrame(np.random.randn(5, 2), columns=list('AB'))
In[182]: dfj
Out[182]: 
          A         B
0 -0.570906 -1.082939
1 -0.181702  0.512260
2  0.267922 -0.495724
3  1.182305  0.792193
4 -0.238487 -0.293700
In[184]: json = dfj.to_json()
In[185]: json
Out[185]: 
'{"A":{"0":-0.570906023,"1":-0.1817023093,"2":0.267921959,"3":1.1823048088,"4":-0.2384873419},"B":{"0":-1.082939276,"1":0.5122603476,"2":-0.495723635,"3":0.7921934348,"4":-0.2936996606}}'
In[186]: pd.read_json(json)
Out[186]: 
          A         B
0 -0.570906 -1.082939
1 -0.181702  0.512260
2  0.267922 -0.495724
3  1.182305  0.792193
4 -0.238487 -0.293700

7.3 Excel

df.to_excel

pd.read_excel

In [145]: df.to_excel('foo.xlsx', sheet_name='Sheet1')
In [146]: pd.read_excel('foo.xlsx', 'Sheet1', index_col=None, na_values=['NA'])
Out[146]: 
                    A          B         C          D
2000-01-01   0.266457  -0.399641 -0.219582   1.186860
2000-01-02  -1.170732  -0.345873  1.653061  -0.282953
2000-01-03  -1.734933   0.530468  2.060811  -0.515536
2000-01-04  -1.555121   1.452620  0.239859  -1.156896
2000-01-05   0.578117   0.511371  0.103552  -2.428202
2000-01-06   0.478344   0.449933 -0.741620  -1.962409
2000-01-07   1.235339  -0.091757 -1.543861  -1.084753
...               ...        ...       ...        ...
2002-09-20 -10.628548  -9.153563 -7.883146  28.313940
2002-09-21 -10.390377  -8.727491 -6.399645  30.914107
2002-09-22  -8.985362  -8.485624 -4.669462  31.367740
2002-09-23  -9.558560  -8.781216 -4.499815  30.518439
2002-09-24  -9.902058  -9.340490 -4.386639  30.105593
2002-09-25 -10.216020  -9.480682 -3.933802  29.758560
2002-09-26 -11.856774 -10.671012 -3.216025  29.369368
[1000 rows x 4 columns]